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A method for extracting features in contents of a document 
without using a word dictionary and a system using the 
method for accurately searching for a relevant document or 
documents at high speed. The method includes steps of 
storing character strings present in a text in a text database 
and possibilities appearing at boundaries of words in the text 
in the form of an occurrence probability file, storing occur- 
rence frequencies of the character strings in the text as an 
occurrence frequency file, extracting characteristic strings 
from a text spcified by a user with use of the occurrence 
probability file, and counting occurrence frequencies thereof 
in the user-specified text. The method calculates similarities 
to the user-specified text with use of the occurrence fre- 
quency file and the occurrence frequencies in the user- 
specified text. 
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FIG. 22 
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METHOD AND SYSTEM FOR EXTRACTING 
CHARACTERISTIC STRING, METHOD AND 
SYSTEM FOR SEARCHING FOR RELEVANT 
DOCUMENT USING THE SAME, STORAGE 
MEDIUM FOR STORING CHARACTERISTIC 
STRING EXTRACTION PROGRAM, AND 
STORAGE MEDIUM FOR STORING 
RELEVANT DOCUMENT SEARCHING 
PROGRAM 

BACKGROUND OF THE INVENTION 

The present invention relates to a method and system for 
extracting a character string indicative of a feature of con- 
tents described in a document, a method and system for 
searching a document database for a document or documents 
having contents similar to those described in a document 
specified by a user with use of the first -mentioned method 
and system, and a storage medium for storing a searching 
program therein. 

As use of personal computers and Internet spreads, elec- 
tronic documents have been explosively increased in these 
years. And its acceleratingly increasing spread is estimated 
in future. In such circumstances, such a strong demand has 
been enhanced that a user wants to search quickly and 
efficiently for a document or documents containing infor- 
mation desired by the user. 

One of techniques for satisfying such a demand is a 
full-text search. In the full-text search, documents to be 
searched are registered as a text in a computer system for 
creation of a database, and the system searches the database 
for a document or documents containing a search character 
string (which will be referred as a query term, hereinafter) 
specified by a user. In this way, the full-text search is 
featured in that, since the searching is carried out for the 
character string itself in the documents, any word can be 
searched unlike a prior art keyword searching system based 
on a previously-set keyword. 

However, in order to reliably search for a document or 
documents containing information desired by the user, it is 
necessary for the user to make a complex search conditional 
expression accurately indicative of user's search intention 
and to enter it into the system. This is a hard business for 
ordinary users who are not experts on information search. 

For the purpose of eliminating such troublesomeness, 
much attention is now focused on a relevant document 
searching technique for showing as an example a document 
(which will be referred to as a 'seed' document, hereinafter) 
containing contents desired by a user per se to search for a 
document or documents similar to the seed document. 

Disclosed as one of the relevant document searching 
methods is, for example, a technique (which will be referred 
to as the prior art 1, hereinafter) for extracting words 
contained in a seed document through morphological analy- 
sis to search for a relevant document or documents based on 
the extracted words, as in JP-A-8-335222. 

In the prior art 1, words contained in a seed document are 
extracted through morphological analysis to search for a 
relevant document or documents containing the words. For 
example, when the seed document is a document 1 

of" omn Ovt- d< f$3K&i 0 User's man- 
ner when the portable phone is in use becomes 
important.) . . . ", words such as (portable 

phone)", •Ti — (manner)" and "Kill (important)" are 
extracted to look up a word dictionary through morphologi- 
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2 

cal analysis. As a result, the system can 
search for a document 2 of 

" . . . . %mtovm%%®<7>m\m}k$tir\,-z 0 (Use of 

portable phones in trains is banned) ..." containing 
5 m ttftfX£i£" as a relevant document. 

However, the prior art 1, which uses the word dictionary 
for word extraction, has two problems which will be men- 
tioned below. 

First one of the problems is that, when a word not listed 
10 in the word dictionary indicates seed-document's essential 
contents (which will be referred to as central concept, 
hereinafter), there is impossibility of accurately searching 
for the document's central concept even when similar 
searching is carried out with use of the other words, because 
15 the essential word cannot be extracted as a search word from 
the seed document. In other words, when information 
desired by the user is a new word, the new word not listed 
in the word dictionary results undesirably in search" of a 
document or documents having concepts shifted from the 
20 target central concept. 

The second problem is that, even when the word desired 
by the user is listed in the word dictionary, a document or 
documents having concepts shifted from the central concept 
may be undesirably searched depending on how to extract 

25 the word. For example, words such as *i£$MBB', — 

and are extracted from the above document 1 

of " . . . <om*z ovi—tf mmwii. 

However, there is undesirably a likelihood that a document 
30 3 of" *Bro&L;tfKov>T&:£2*Lrf: 0 (I got an advice 

about how to talk on phone) ..." is calculated low in its 

similarity because the word cannot be extracted. 

This results from the fact that search words are all 

extracted from the word dictionary. 
35 The problems in the prior art 1 have been explained 

above. 

For the purpose of solving the above problems, there has 
been suggested a technique (which will be referred to as the 
prior art 2, hereinafter) in Japanese Patent Application No. 
9-309078, by which character strings each having n con- 
tinual characters of a type (which strings will be referred to 
as the n-grams, hereinafter) such as 'Kanji' or 'Katakana' are 
mechanically extracted according to the character types to 
search for a relevant document or documents, without using 
any word dictionary. 

In the prior art 2, how to extract the n-gram is changed 
according to the character types to extract meaningful 
n-grams (which will be referred to as characteristic strings, 

50 hereinafter). For example, 2-grams are mechanically 
extracted from a character string of Kanji characters (which 
string will be referred to as a Kanji character string, 
hereinafter); while a character string of Katakana characters 
having the longest length (which string will be referred to as 

55 a Katakana longest character string, hereinafter), that is, a 
Katakana longest character string itself is extracted from 
character strings of katakana characters (which strings will 
be referred to as Katakana character strings, hereinafter). In 

this case, characteristic strings such as *WS", 
60 "T"*— % and are extracted 

from the above document 1 of 
o&m oyvt-tfm&lzZZ, ..."as a seed 

document. That is, since the character string is also 

65 extracted without missing, even the document 3 of 
" . . . . *e-c*>ggL£cowT££**i*:, ..." can be 
extracted with a correctly calculated similarity. 
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In the prior art 2, however, there is a possibility of the tail-position frequency, hereinafter) at the time of reg- 

extracting even an n-grani across the words of a Kanji istering the documents and then stores the counted values in 

character string available to make a compound word from an occurrence information file. In the case of the above 

the Kanji character string. For this reason, use of this search document 1, occurrence information is obtained of an occur- 

method causes calculation of a similarity of such a document 5 rence f reqU ency of is 1, a head-position frequency is 1 

that is not similar to the seed document in contents, which and ta ji_po S i uon frequency is 0. FIG. 2 shows an exemplary 

results in a problem that such a document as not to be occurrence information file. 

associated with the seed document is undesirably searched. Thereafter, looking up the above occurrence information 

For example, for the characteristic string of file, the system calculates head and tail probabilities of each 
m: 55f£' extracted from the document 1 ofio 1 -gram and stores them in an occurrence probability file. For 

" . . . . &&%£<?>&M'tiim mm *?i — ■ ..." as a seed example, a head-position probability of 1-gram is 

document, its similarity is calculated, which undesirably 768/4,740=0.16 and a tail-position probability of 1-gram 

results in erroneous search of a document 4 of »8t* is 492/4,740*0.10. FIG. 3 shows an exemplary occur- 

" . . . . «Woi|?**!»Cfc*!>ic,£ifc L^KTlifcibfc^o (In rence probability file. 

order to prevent charging, it must be grounded.) ..." as a 15 Explanation will next be made as to how to search for a 

relevant document. document or documents in the prior art 3 by referring to a 

For solving the above problem, there has been suggested single, character type string of as an example 

a technique (which will be referred to as the prior art 3, First 3 sets of 2 pairs of 1-grams of 

hereinafter) for extracting a characteristic string using sta- 2Q and are extracted ^ the single character type 

tistical information of 1-gram, as shown in a Journal of the . e , « . , 

Information Processing Society of Japan, pp. 2286 to 2297, s *"« of -MM- In each 1-gram pair, the system acquires 

\>ht ift Wr» 11 w« wm kp r i QQ7 a tail-position probability of front one of 1-grams and a 

voi. s& ino.ii, ixovemoer ivy/. head-position probability of rear one of 1-grams from the 

In the prior art 3, with respect to each of 1-grams occurre nce probability file previously created at the time of 

appearing in a document to be registered, a probability of ^ ±& docume nt registration, and calculates a division prob- 

1-gram forming a head of a word (which probability will be abiUty based on me acquired head and tail probabilities, 

referred to as a he ad -position probability, hereinafter) as FIG. 4 shows how to calculate division probabilities for 
well as a probability ot 1-gram terming a tail or. a word 

(which probability will be referred to as a tail-position the 3 1-grams extracted from In this example, the 

probability, hereinafter) are previously calculated at the time $ 0 division probabilities of ("SI*, •S*, -ft-) and 

of registering the document. In this case, it is assumed that f a re calculated as 0.011, 0.054 and 0.005 respec- 

a word consists of a string of an single type of characters . , _ v 

such as Kanji or Katakana (which string will be referred to ^ Su » °: 054 in *? J?""? P"**^ «*(•«*. 

as a single character type string, hereinafter) and is delimited m lhese d,vmon Probabilities is laiger than the division 

at a character type boundary such as the boundary between i5 threshold of 0.050, division is carried out between **• and 

Kanji and Katakana, and that the 1-gram located directly .3. On the other hand, the division probabilities of 

after the character type boundary is regarded as a head ^ and Uvel 

1 -gram m a word and the 1 -gram located directly before the 1,. V. ' v . , ' t1 _ , . , ./ cr ,^ c U 

, & 4 4 « j • j j ♦ t 1 • Since these are smaller than the division threshold of 0.050, 

character type boundary is regarded as a tad 1-gram in a D0 division ^ ^mcd out betweeri these a reS ult, 

For example, with regard to the Kanji character string -MM- is divided at between and into two 

■**J- delimited at a character type boundary and extracted characteristic strings of and 

from the above document 1 of The detailed processing method in the prior art 3 has been 

» -ft- is a head ex P lained above - In ^ wav > consideration is paid in the 

■ • • • * ■ • » 45 pr j or art 3 not t0 searcn f or a document or documents not 

1-gram in the word and -ffl" is a tail 1-gram in the word. similar in contents to the seed document, by extracting 

For searching for a relevant document or documents, a characteristic strings using 1-gram statistical information so 

single character type string is first extracted from a specified as not to extract an unsuitable characteristic string across 

seed document. Next a probability of division of the single words. 

character type string between continual two of 1-grams in 50 However, the prior art 3 has a problem that, since the 

the single character type string (which probability will be system judges division or non-division on the basis of the 

referred to as a division probability, hereinafter) is calculated absolute value of the division probability, an extraction 

on the basis of a tail-position probability of front one of the accuracy of the characteristic string as a word is low. For 

continual two 1-grams in the single character type string and example, with respect to a single character type string of 

a head-position probability of rear one thereof. When the 55 mc system exacts a pak of 1-grariis of (•*-, •*-) 

value of a calculated division probability exceeds a prede- , . ./ ~ AC , r . . r u\ .u 

. . - , . . . ... c c \ _ j. - - and calculates 0.054 as a division probability between the 

termined value (which will be referred to as a division . r ' 

threshold, hereinafter), the system performs division of the -grams. . 

single character type string thereat. L Sl ?^ J r ^ J ■ * , the J dmslon 

r , . , . , « .« < threshold of 0.050, division is erroneously earned out, as 

Explanation will be made as to detailed processing opera- 60 

lions of the prior art 3 with a division threshold of 0.050. between and in (which division will be 

First of all, with respect to each of 1-grams appearing in referred to 35 the erroneous division, hereinafter), with the 

all documents to be registered, the system counts an occur- result the s y stem undesirably extracts unsuitable two 

rence frequency, the number of times of occurrence at the characteristic strings. This leads to a problem that the system 

heads of words (which will be referred to as the head- 65 undesirably searches also for a document or documents 

position frequency, hereinafter) and the number of times of related to m $f ('o-bi 5 in Japanese pronunciation)" as a 

occurrence at the tails of words (which will be referred to as relevant document or documents. 
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As has been explained above, the word extracting method 
using the word dictionary as in the prior art 1 has a problem 
that, a word not listed in the word dictionary indicates the 
main concept of the seed document, the system unfavorably 
searches for a document or documents shifted from the main 5 
concept. 

Further, the method for simply extracting n-gram from the 
single character type string according to the character type 
as in the prior art 2 has a problem that, since the system 
undesirably extracts n-gram across words from a Kanji 10 
character string available to make a compound word the 
system undesirably searches for a document or documents 
not associated with the seed document as a relevant docu- 
ment or documents. 

Furthermore, the method for calculating the division 15 
probability using the 1-gram statistical information and 
judging division or non-division on the basis of the absolute 
value of the calculated division probability value as in the 
prior art 3 also has a problem, since the extraction accuracy 
of the characteristic string as a word is low, the system 2 o 
undesirably involves mixture of search noise, thus resulting 
in erroneous search of a document or documents shifted 
from the target main concept as a relevant document or 
documents. 

SUMMARY OF THE INVENTION 25 

In order to solve the above problems in the prior arts, it 
is therefore an object of the present invention to provide a 
method and system for extracting a characteristic string with 
less erroneous division. 

Another object of the present invention is to provide a 
method and system for extracting a characteristic string with 
less erroneous division and thus with less search noise to 
realize searching of a relevant document or documents with 
less shift from the main concept of a seed document. 35 

In order to solve the above problems, the characteristic 
string extracting method in accordance with the present 
invention extracts a characteristic string from a seed docu- 
ment through operations of steps which follow. 

More specifically, The characteristic string extracting 40 
method of the present invention includes steps of registering 
a document and extracting a characteristic string from a seed 
document, 

wherein the document registration step further includes 
steps of: 45 
reading a document to be registered for document 

registration (step 1); 
dividing character strings in the registered document 
read in the document reading step by character type 
boundaries between Kanji and Katakana to extract 50 
single character type strings (step 2); 
with respect to each of the single character type strings 
extracted in the above single character type string 
extracting step, judging a character type thereof and 
when determining as a Kanji or katakana type, with 55 
respect to a predetermined length of n-gram in the 
registered document, counts an occurrence 
frequency, a frequency of occurrence as a word head 
(which will be referred to as the head-position 
frequency, hereinafter), a frequency of occurrence as 60 
a word tail (which will be referred to as the tail- 
position frequency, hereinafter), and a frequency of 
occurrence of the n-gram itself as a word (which will 
be referred to as the independent frequency, 
hereinafter) (step 3); 65 
adding n-gram occurrence information counted by the 
above occurrence information counting step to 



occurrence information of the n-gram of the docu- 
ment already registered in a database to calculate 
occurrence information on the entire database and 
storing the calculated information in an associated 
occurrence information file (step 4); 

with respect to the n-gram which was counted in its 
occurrence information in the above occurrence 
information counting step, acquiring occurrence 
information of the entire database from the associ- 
ated occurrence information file to calculate a prob- 
ability thereof as a word head (which will be referred 
to as the head-position probability, hereinafter), a 
probability thereof as a word tail (which will be 
referred to as the tail-position probability, 
hereinafter), and a probability of occurrence as the 
n-gram itself (which will be referred to as the inde- 
pendent probability, hereinafter) and storing the cal- 
culated probabilities in the associated occurrence 
probability file (step 5); 

extracting a predetermined length of n-gram from the 
single character type string extracted in the above 
single character type string extracting step to count 
an occurrence frequency in the registered document 
(step 6); 

storing the occurrence frequency counted in the above 
occurrence frequency counting step in an associated 
occurrence frequency file (step 7); and 
extracting a characteristic string from a seed document, 
wherein the characteristic string extracting step further 
includes steps of: 

reading the seed document (step 8); 

dividing a character string in the seed document read 
in the above seed document reading step by char- 
acter type boundaries to extract single character 
type strings (step 9); and 

with respect to the single character type string 
extracted in the searching single character type 
string extracting step, judging a character type 
thereof (step 10), 

wherein, when the character type is of Kanji or 
Katakana, the system reads the occurrence prob- 
ability file to acquire an independent probability of 
a character string ranging from the head of the 
single character type string to an i-th character, an 
independent probability of a character string of the 
head to (i+l)th characters, a head-position prob- 
ability of the (i+l)th character, and a head-position 
probability of an (i+2)th character; calculates a 
probability of division of the single character type 
string at the i-th character (which will be referred 
to as a division probability, hereinafter) as a prod- 
uct of the independent probability of the character 
string of the head to the i-th characters and the 
head-position probability of the (i+l)th character; 
calculates a division probability at the (i+l)th 
character as a product of the independent prob- 
ability of the character string of the head to the 
(i+l)th characters and the head-position probabil- 
ity of the (i+2) th character; compares the division 
probability of the i-th character with a division 
probability of the (i+l)th character to set larger 
one of the division probabilities as a single char- 
acter type string division point (which will be 
referred to as the division point, hereinafter); the 
character type is not of Kanji or katakana, extracts 
the single character type string per se as the 
characteristic string; and repeats similar opera- 
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tions over the remaining character strings other 
than the extracted characteristic string to extract 
another characteristic string. 
In order to attain the above objects, further, a method for 
searching for a relevant document or documents in accor- 5 
dance with the present invention extracts a characteristic 
string through the above steps to search for a document or 
documents similar to a seed document using the extracted 
characteristic string. 

More specifically, the relevant document searching 10 
method of the present invention includes steps of registering 
a document and searching for a document or documents 
similar to a seed document, 
wherein the document registration step further includes 15 
steps of: 

reading a document to be registered for document 
registration (step 1); 

dividing character strings in the registered document 
read in the document reading step by character type 20 
boundaries between Kanji and Katakana to extract 
single character type strings (step 2); 

with respect to each of the single character type strings 
extracted in the above single character type string 
extracting step, judging a character type thereof and 25 
when determining as a Kanji or katakana type, with 
respect to a predetermined length of n-gram in the 
registered document, counting an occurrence 
frequency, a frequency of occurrence as a word head 
(which will be referred to as the head-position 30 
frequency, hereinafter), a frequency of occurrence as 
a word tail (which will be referred to as the tail- 
position frequency, hereinafter), and a frequency of 
occurrence of the n-gram itself as a word (which will 
be referred to as the independent frequency, 35 
hereinafter) (step 3); 

adding n-gram occurrence information counted by the 
above occurrence information counting step to 
occurrence information of the n-gram of the docu- 
ment already registered in a database to calculate 40 
occurrence information on the entire database and 
storing the calculated information in an associated 
occurrence information file (step 4); 

with respect to the n-gram which was counted in its 
occurrence information in the above occurrence 45 
information contain step, acquiring occurrence infor- 
mation of the entire database from the associated 
occurrence information file to calculate a probability 
thereof as a word head (which will be referred to as 
the head-position probability, hereinafter), a prob- 50 
ability thereof as a word tail (which will be referred 
to as the tail-position probability, hereinafter), and a 
probability of occurrence as the n-gram itself (which 
will be referred to as the independent probability, 
hereinafter) and storing the calculated probabilities 55 
in the associated occurrence probability file (step 5); 

extracting a predetermined length of n-gram from the 
single character type string extracted in the above 
single character type string extracting step to count 
an occurrence frequency in the registered document 60 
(step 6); 

storing the occurrence frequency counted in the above 
occurrence frequency counting step in an associated 
occurrence frequency file (step 7); and 
extracting a characteristic string from a seed document, 65 
wherein the relevant document searching step further 
includes steps of: 
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reading the seed document (step 8); 

dividing a character string in the seed document read 
in the above seed document reading step by char- 
acter type boundaries to extract single character 
type strings (step 9); 

with respect to tie single character type string 
extracted in the searching single character type 
string extracting step, judging a character type 
thereof (step 10), 

wherein, when the character type is of Kanji or 
Katakana, the system reads the occurrence prob- 
ability file to acquire an independent probability of 
a character string ranging from the head of the 
single character type string to an i-th character, an 
independent probability of a character string of the 
head to (i+l)th characters, a head-position prob- 
ability of the (i+l)th character, and a head-position 
probability of an (i+2)th character; calculates a 
probability of division of the single character type 
string at the i-th character (which will be referred 
to as a division probability, hereinafter) as a prod- 
uct of the independent probability of the character 
string of the head to the i-th characters and the 
head-position probability of the (i+1) th character; 
compares the division probability of the i-th char- 
acter with a division probability of the (i+l)th 
character to set larger one of the division prob- 
abilities as a single character type string division 
point (which will be referred to as the division 
point, hereinafter); the character type is not of 
Kanji or katakana, extracts the single character 
type string per se as the characteristic string; and 
repeats similar operations over the remaining 
character strings other than the extracted charac- 
teristic string to extract another characteristic 
string, 

counting occurrence frequencies of all characteristic 
strings extracted in the above characteristic string 
extracting step (step 11); 
reading the occurrence frequency file for all the 
characteristic strings extracted in the characteristic 
string extracting step to acquire occurrence fre- 
quencies of the characteristic strings in each docu- 
ment in the database (step 12); 
with respect to the characteristic strings extracted in 
the above characteristic string extracting step, 
calculating their occurrence frequencies in the 
seed document counted in the above within-seed- 
document occurrence frequency counting step as 
well as similarities between the seed document 
and the documents in the database on the basis of 
a predetermined computation expression with use 
of the occurrence frequencies of the documents 
within the database acquired in the above within- 
database occurrence frequency acquiring step 
(step 13); and 
outputting a searched result on the basis of the 
similarities calculated in the above similarity cal- 
culating step (step 14). 
The principle of the present invention based on the above 
document searching method will now be explained. 

In the present invention, the steps 1 to 7 are carried out for 
document registration. 

First of all, in the document reading step 1, the system 
reads a document to be registered. In the next single char- 
acter type string extracting step 2, the system divides char- 
acter strings in the registration document read in the above 
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document reading step 1 by character type boundaries of In the next occurrence frequency counting step 6, the 

Kanji or Katakana to extract character strings of a single system extracts a predetermined length of n-grams from all 

character type. For example, single character type strings of single character type strings extracted in the single character 

•flWi' m T<D r *W!SWS m m <O m *tt # *»jt- and type string extracting step 2 and counts occurrence frequen- 

^ . ^ ' „ ' * ' . . ' * ' , 5 cies thereof in the registration document. And in the occur- 

-ZtiX^Z- are extracted from the above document 2 of reQce frequency me creating/registering step 7, the system 

" . . . . «$rtT<0&$?3B4>ttffl tigit $*fCv^o . . . stores the occurrence frequencies of the o-grams extracted in 

In the occurrence information counting step 3, next, the the above occurrence frequency counting step 6 in the 

system judges the character type of each of the single corresponding occurrence frequency file, 

character type strings extracted in the single character type FIG 2 4 shows a procedure of operations of creating an 

string extracting step 2, and if the character type is of Kanji occurrence frequency file with use of the 

or Katakana, ^the system counte an occurrence frequency of aforementioned document 2 of 
a predetermined length n of n-gram in the registration 

document, a head-position frequency, a tail-position fre- " «**TC K*>ttJB ttgit 3*lT^4«> ..." as an 

quency and an independent frequency thereof. For example, example. 

assume that the system counts occurrence frequencies, head 15 First, in the single character type string extracting step 2, 

frequencies and tail frequencies of 1-gram and 2-gram from the system extracts all single character type strings from the 

Kanji and Katakana character strings. Then with respect to document 2 as a registration document, 

"the single character type strings extracted in the single In the next "occurrence frequency counting step" 6, the 

character type string extracting step 2, the system counts 1 system extracts a predetermined length of n-gram from all 

for the occurrence frequency of , 1 for its head-position 20 to e single character type strings extracted in the above single 

frequency, 0 for its tail-position frequency and 0 for its character type string extracting step 2, and counts occur- 

independent frequency, and counts 1 for the occurrence rence frequencies thereof in the registration document. In the 

frequency of •»•. 1 for its head-position frequency, 0 for msiMed h "H*' ,11 KT"?* * T 
its tail-position frequency and 0 for its independent fre- „ ^mshavmg lengths of 3 or less from the single character 
quency. tyP e strings. In this case, the system extracts -It" , and 
In the next occurrence information file creating/ -ft- having a length of 1; and ^rt* having a 
registering step 4, the system adds occurrence information . fa of 2 and mM§m haym a { h Qf 3 frQm 
of the n-gram already extracted in the occurrence informa- 
tion counting step 3 to occurrence information on the deluded in single character type strings 2404; and 
document already registered in the database and stores 30 counts occurrence frequencies thereof in the document 2. As 
occurrence information as accumulated information in the a result > the svstem 2 for the occurrence frequency of 
associated occurrence information file. FIG. 5 shows an 'fc - in the document 2 and 1 for the occurrence frequency 
exemplary occurrence information file. The illustrated 0 f j n me document 2. 

occurrence information file is an example in which the 3f . In tne occurrence frequency file creating/registering step 

occurrence information extracted in the above occurrence 7? the system stores me occurrence frequencies of the 

information counting step 3 is stored. The illustrated occur- n-grams extracted in the occurrence frequency counting step 

rence information file shows information on an occurrence 6 ^ the corresponding occurrence frequency file. As a result, 

frequency of 4,740, an head-position frequency of 768, an me system stores ^ the occurrence frequency file the 

tail-position frequency of 492 and an independent frequency 4Q occurrence frequencies of the n-grams from the document 2 

of 42 for the 1-gram and also information on an in combination with an identification number of the regis- 

occurrence frequency of 462, a head-position frequency of ^fton document, in the form of (2,2) for 1-gram (2,1) 

419, a tail-position frequency of 52 and an independent , , . WWT 

e ^ e , „ for 1-gram (2,1) for 1-gram (2,1) for 2-gram , 

frequency of 48 for the 2-gram ^ „ „ m <v . „ , , . 

In the occurrence probability file creating/registering step 45 (2.1) for 2-gram and (2,1) for 3-gram -MOT-. In this 

5, the system calculates occurrence probabilities of n-grams case, (2,1) means that 2-gram appears once in 

whose occurrence information are stored in the occurrence the document having an identification number 2. 

information file creating/registering step 4, and stores the For searching operations, the system executes the steps 8 

probabilities in the associated occurrence probability file. to 14. 

With respect to the 1-gram -»-, for example, as shown in 50 » «« seed document reading step 8 the system 

FIG. 5, the system counts 4,740 of its occurrence frequency, reads the document 1 as a seed document. In the next 

768 for its head-position frequency, 492 for its tail-position searching angle character type string extracting step 9, the 

frequency and 42 for its independent frequency, and thus ^ tem « m the seed document 

calculates 0.16 (-768/4,740) for its head-position (document 1) «*• "> the seed document reading step 8 by 

probability, 0.10 (=492/4,740) for its tail-position probabil- 55 character type boundaries to extract single character type 

ity and 0.01 (=42/4,740) for its independent probability. strings of smgle character types. 

FIG. 6 shows an exemplary occurrence probability file. The . ] n th < ; characteristic stnng extracting step 10 the system 

illustrated occurrence probability file shows an example judges the character type of each of the single character type 

when the occurrence probabilities extracted in the above strings extracted m the searchmg smgle character type string 

occurrence information counting step 3 are stored. That is, 60 extract mg step . Tr , 

the example shows information on a head-position probabil- " *? cha f ac,er ^ * ° ^ or ^™>. } he system 

ity of 0.16, a tail-position probability of 0.10 and an inde- reads . the aforementioned occurrence probability file and 

acquires an independent probability of a character stnng of 

pendent probability of 0.01 for the 1-gram and also from a head l0 i_ lh characters in the single character type 

information on an head-position probability of 0.90, an 65 string, an independent probability of from the head to (i+l)th 

tail-position probability of 0.11 and an independent prob- character, a head-position probability of the (i+l)th charac- 

ability of 0.10 for the 2-gram "fcflf • . ter and a head-position probability of the (i+2)th character. 
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And the system calculates a division probability at the i-th the characteristic string extracting step 10, the system looks 

character as a product of the independent probability of the up the above occurrence frequency file and acquires occur- 

character string of the head to i-th characters and the rence frequencies in the documents within the database, 

head-position probability of the (i+1) th character; and i 0 the similarity calculating step 13, with regard to the 

calculates a division probability at the (i+l)th character as a 5 characteristic strings extracted in the characteristic string 

product of the independent probability of the character string extracting step 10, the system calculates occurrence trequen- 

of the head to (i+l)th characters and the head-position cics of thc featured strings counted in the within-seed- 

probability of the (i+2)th character And the system com- document occurrence counting step 11 and in the witbin- 

pares the division probabilities . of the i-th and (i + l)tb database occurreoce f requ ency acquiring step 12 as well as 

characters, selects larger one of the probabilities as a divi- ^ . ... r t ? « . 1 4 « ^ 

- 4 j 4 T u ~* * • f f 4uuj1° similarities thereof on the basis of the occurrence frequen- 

sion point, and extracts a character stnng of from the head » « « c 4| _ , „ . ^ 

to the division point characters as a characteristic string. cie * m * he documents of the database. 

If the character type is not of Kanji or Katakana, then the F « * c . calculatl0 ° <* the similarities, for example, such 

system uses the single character type string itself as a a similant y computation expression (1) as disclosed in 

characteristic string and repeats operations similar to the JP-A-6-110948 and given below may be employed, 

above to extract another characteristic string. 15 A similarity S(i) to document i is expressed as follows. 

FIG. 8 shows an example of how to extract characteristic 

strings from the single character type string A, (1) 

extracted from the document 1. The system first pi 

calculates a division probability at the first character in s ® ~ " 

20 

*#KfifttS&- as a product of an independent probability of 0.01 

for and a head-position probability of 0.11 for "ft*, 
that is, 0.001 (-0.01x0.11). Similarly, the system calculates 
a division probability at the second character as a product of 

an independent probability of 0.10 for -M* and a head- 25 Where ' m f<?™ a normalized weight for the j-th 

n-gram in the seed document and is calculated from occur- 

position probability of 0.36 for that is, 0.036 (=0-10x rence fr equenc i es Q f ihe n . gT2Lms { n the seed document. R(j) 

0.36). The system then compares these division probabilities indicates a normalized weight of the j-th n-gram in a 

and divides the single character type string by the character document in the database and is calculated from occurrence 

having the larger probability. In this case, since the division frequencies of the n-grams of the documents in the database, 

probability 0.036 of the second character is larger than the 30 ^ formalized weight* is an n-gram occurrence bias in the 

other, the single character type string SU^rt- is divided into database. This means that the larger the value of the nor- 

anc j malized weight is the n-gram appears as more biased to a 

Also shown in FIG. 9 is an example of the single specific document. How to calculate the normalized weight 

. . , , c is explained in JP-A-6-110948 and thus explanation thereof 

character type string which cannot be divided suit- » ^ omitted herein „ mdkates the number of ^ the docu . 

ably in the prior art 3, which will be explained in connection ments m ^ database 

with dividing operations of the present invention. First, the when ^ simiIarity s(i) for the document j ^ calculated 

system of the present invention calculates 0.0001 (-0.015k using the similarity expression (i) when the document 1 is 

0.01) for a division probability of the first character in ^ specified as , he see d document, it results in: 

•SS- as a product of an independent probability (0.01) of S(l)-1.0 

and an independent probability (0.01) of The S(2)-0.262 

system also calculates a division probability at the second S(3) =0.048 

character, that is, an occurrence probability of as a S(4)=0.0 

single character type string itself, as 0.10 for the independent « As a result, the documents are arranged in an descending 

probability of . The system compares these probabili- ° rder of ^ similarities in the search result output step 14, 

ties and divides the single character type string by the documents 1 2 and 3 are listed in this older. In As 

character having larger one of the probabilities into single connection, the document 4 cannot be output as a search 

character type strings. In this case, however, since the result because ,t has a similarity of 0. 

50 As has been explamed above, the similarity document 

independent probability 0.10 of is larger, is searching method of the present invention based on the 

divided at the second character, which means that the single characteristic string extracting method can mechanically 

character type string m flS : fc' is eventually not divided and extract character strings from the single character type string 

extracted as a group. without using any word dictionary as in the prior art 1. 

In this way, since comparison of the division probabilities 55 Therefore the present invention can perform searching 

for the division of the single character type string enables operation without missing of any word and thus can accu- 

word division accurately reflecting reflecting actual occur- rately search for the concept of the seed document, 

rence circumstances in the database, the present invention Further, unlike the prior art 2 for simply extracting 

can reduce unsuitable division more remarkably than the n-grams from a single character type string according to the 

aforementioned prior art 3 for performing the division based 60 character types, the present invention extracts a group of 

on the absolute values of the division probabilities. meaningful n-grams on the basis of statistical information 

In the within-seed -document occurrence frequency count- and can realize accurater searching of the concept of the seed 

ing step 11, next, the system counts occurrence frequencies document. 

of the characteristic strings in the seed document, extracted Further, unlike the prior art 3 for performing the division 

in the above characteristic string extracting step 10. 65 based on the absolute values of the division probabilities, the 

In the within-database occurrence frequency acquiring present invention compares the division probabilities and 

step 12, with respect to the characteristic strings extracted in performs the division based on the larger probability. 
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Accordingly the present invention can realize word division FIG. 20 shows a processing example when the possibility 

accurately reflecting actual occurrence circumstances in the comparison/characteristic string extraction program 142 is 

database and can remarkably reduce the possibility of applied to a Katakana character string in the first embodi- 

unsuitable word division. In this way, since the present ment of the present invention; 

invention can avoid searching of unsuitable characteristic 5 FIG. 21 is a PAD showing a procedure of processing 

strm^wbencompa^ ods £ for comparison of division 

suitably search for the concept of the seed document and can r , ...... I c — t - e u 

search for a relevant document or documents at a high speed. Fobabilities and for extraction of a characteristic string 

r (which will be referred to as the possibility comparison/ 

BRIEF DESCRIPTION OF THE DRAWINGS characteristic string extraction program 142a, hereinafter) in 

FIG. 1 shows an entire arrangement of a relevant docu- 10 a second embodiment of the present invention; 

ment searching system in accordance with a first embodi- FIG. 22 is a PAD showing a procedure of processing 

ment according to the present invention; operations of the possibility comparison/characteristic string 

FIG. 2 shows an exemplary occurrence information file in extraction program 142 in the first embodiment of the 

a prior art 3; present invention; 

FIG. 3 shows an exemplary occurrence probability file in 15 FIG . 23 is a PAD showing a procedure of processing 

the prior art 3; operations of the possibility comparison/characteristic string 

FIG. 4 shows an example of an characteristic string extraction program 142a in the second embodiment of the 

extracting method in the prior art 3; present invention; 

FIG. 5 shows an exemplary occurrence information file in 2Q piG. 24 shows a procedure of operations of creating an 

accordance with the present invention; occurrence frequency file in accordance with the present 

FIG. 6 shows an exemplary occurrence probability file in invention; 

accordance with the present invention; FIG. 25 is a PAD showing a procedure of processing 

FIG. 7 shows an example of n-gram index in accordance operations of an occurrence frequency file creation/ 

with a third embodiment of the present invention; 25 registration program 127 in the first embodiment of the 

FIG. 8 shows a processing example when a program for present invention; 

comparison of division probabilities and for extraction of a Fia 2 6 is a PAD showing a procedure of processing 

characteristic string is applied to a Kanji character string in operations of an occurrence frequency acquirement program 

the first embodiment of the present mvention; 146 ^ the first embodiment of the present invention; 

FIG. 9 shows an example of how to extract a character- 30 nG 2? ghows m ex le of ations of executing the 

istic string m the present invention; characteristic string extraction program 141 in the first 

FIG. 10 is a problem analysis diagram (PAD) showing a embodiment of the present invention; 

procedure of processing operations of a system control n/1 , fl , in. . i i . _r • • 

iin- *Lc* uj' «. c *u * • FIG. 28 shows an example of how to calculate division 

program 110 in the first embodiment of the present inven- , . 4 , ^ A r , ,. t - if _ 

|^ on ° 35 probabilities in the first embodiment of the present lnven- 

F1G. 11 is a PAD showing a procedure of processing tl °°* , „ . 

operationsof a document registration control program 111 in 29 shows an arrangement of the relevant document 

the first embodiment of the present invention; search program 131 in the third embodiment of the present 

FIG. 12 is a PAD showing a procedure of processing mvenuon > 

operations of an occurrence information file creation/ 40 FIG. 30 shows a procedure of operations of executing an 

registration program 121 in the first embodiment of the occurrence frequency acquirement program 146a in the third 

present invention; embodiment of the present invention; 

FIG. 13 is a PAD showing a procedure of processing FIG. 31 shows a structure of a characteristic string extrac- 

operations of a search control program 112 in the first tion program 141a in a fourth embodiment of the present 

embodiment of the present invention; 45 invention; 

FIG. 14 is a PAD showing a procedure of processing FIG. 32 is a PAD showing a procedure of processing 

operations of a relevant document search program 131 in the operations of the characteristic string extraction program 

first embodiment of the present invention; 141a in the fourth embodiment of the present invention; and 

FIG. 15 shows example of bow to acquire an occurrence 5Q FIG. 33 shows an example of executing the characteristic 

frequency in the third embodiment of the present invention; string extraction program 141a in the fourth embodiment of 

FIG. 16 is a PAD showing a procedure of processing the present invention, 
operations of an occurrence probability file creation/ 
registration program L24 in the first embodiment of the 

present invention; 55 

FIG. 17 is a PAD showing a procedure of processing A first embodiment of the present mvention will be 

operations of a characteristic string extraction program 141 detailed with reference to FIG. 1. 

in the first embodiment of the present invention; The first embodiment in which the present invention is 

FIG. 18 is a PAD showing a procedure of processing applied to a relevant document searching system includes a 

operations of a program 142 (which will be referred to as the 60 display 100, a keyboard 101, a central processing unit (CPU) 

possibility comparison/characteristic string extraction pro- 102, a magnetic disk unit 105, a floppy disk drive (FDD) 

gram 142, hereinafter) for comparison of division probabili- 103, a main memory 106 and a bus 107 connected therebe- 

ties and for extraction of a characteristic string in the first tween. 

embodiment of the present invention; Stored in the magnetic disk unit 105 are a text 150, an 

FIG. 19 is a PAD showing a procedure of processing 65 occurrence information file 151, an occurrence probability 

operations of a division probability calculation program 143 file 152 and an occurrence frequency file 153. Information 

in the first embodiment of the present invention; about registration documents and a seed document stored in 
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a floppy disk 104 is read from tbe FDD 103 into a work area 
170 reserved in the main memory 106 or into the magnetic 
disk unit 105. 

Stored in the main memory 106 are a system control 
program 110, a document registration control program 111, 
a shared library 160, a text registration program 120, an 
occurrence information file creation/registration program 
121, an occurrence probability file creation/registration pro- 
gram 124, an occurrence frequency file creation/registration 
program 127, a search control program 112, a search con- 
ditional expression analysis program 130, a relevant docu- 
ment search program 131 and a searched result output 
program 132. Also reserved in the main memory 106 is the 
work area 170. 

These programs are stored in a portable storage medium 
such as the floppy disk 104 or a CD medium such as 
CD-ROM (not shown in FIG. 1). The programs are read out 
from such storage medium and installed into the magnetic 
disk unit 105. At the time of starting the relevant document 
searching system, the system control program 110 causes 
these programs to be read out from the magnetic disk unit 
105 and stored into the main memory 106. 

The shared library 160 is made up of a single character 
type string extraction program 161. 

The occurrence information file creation/registration pro- 
gram 121, which includes an occurrence information count 
program 122 and an occurrence information file creation 
program 123, is arranged to call the single character type 
string extraction program 161 from the shared library 160, 
which will be explained later. 

The occurrence probability file creation/registration pro- 
gram 124 includes an occurrence probability calculation 
program 125 and an occurrence probability file creation 
program 126. 

The occurrence frequency file creation/registration pro- 
gram 127 includes an occurrence frequency count program 
128 and an occurrence frequency file creation program 129. 

The relevant document search program 131, which 
includes a seed document read program 140, a characteristic 
string extraction program 141, a within-seed-document 
occurrence frequency count program 145, an occurrence 
frequency acquirement program 146 and a similarity calcu- 
lation program 148, is configured to call the single character 
type suing extraction program 161 from the shared library 
160, which will be explained later. 

The characteristic string extraction program 141 is used to 
call a program 142 for comparison of division probabilities 
and for extraction of a characteristic string (which will be 
referred to as the possibility comparison/characteristic string 
extraction program 142, hereinafter). The possibility 
comparison/characteristic string extraction program 142 is 
arranged to call a division probability calculation program 
143. The division probability calculation program 143 is 
arranged to call an occurrence probability file read program 
144. 

The occurrence frequency acquirement program 146 is 
provided to call an occurrence frequency file read program 
147. 

The document registration control program 111 and 
search control program 112 are activated, in response to 
user's instruction from the keyboard 101, under control of 
the system control program 110 to control the text registra- 
tion program 120, occurrence information file creation/ 
registration program 121, occurrence probability file 
creation/registration program 124 and occurrence frequency 
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file creation/registration program 127 and to control the 
search conditional expression analysis program 130, rel- 
evant document search program 131 and searched result 
output program 132, respectively. 
5 Explanation will be made as to a procedure of operations 
of the relevant document searching system of the present 
embodiment. 

The processing procedure of the system control program 
110 will first be explained by referring to a problem analysis 
10 diagram (PAD) of FIG. 10. 

The system control program 110 first analyzes a command 
input from the keyboard 101 in a step 1000. 

When determining in a next step 1001 that the input 
15 command is one for registration execution based on its 
analyzed result, the program 110 starts the document regis- 
tration control program 111 in a step 1002 to register a 
document. 

When determining in a step 1003 that the input command 

20 is one for search execution, the program 110 starts in a step 
1004 the search control program 112 to search a relevant 
document or documents. 

The processing procedure of the system control program 
110 has been explained above. 

25 Explanation will next be made as to a processing proce- 
dure of the document registration control program 111 to be 
activated by the system control program 110 in the step 1002 
of FIG. 10, with reference to a PAD of FIG. 11. 

The document registration control program 111 first acti- 

30 vates the text registration program 120 in a step 1100 to read 
text data of a document to be registered from the floppy disk 
104 loaded in the FDD 103 into the work area 170 and then 
to load the data into the magnetic disk unit 105 as the text 
150. The text data may be input to the system not only by 

35 using the floppy disk 104 but also by using other means such 
as a communication line (not shown in FIG. 1) or a 
CD-ROM drive (not shown in FIG. 1). 

In a next step 1101, the document registration control 
program 111 starts the occurrence information file creation/ 

40 registration program 121 to read out the text 150 stored on 
the work area 170, to create the occurrence information file 
151 for n-grams therein and to store it into the magnetic disk 
unit 105. 

4s In a next step 1102, the document registration control 
program 111 starts the occurrence probability file creation/ 
registration program 124 to calculate occurrence probabili- 
ties of the n-grams in the text 150 stored in the work area 170 
and to store it into the magnetic disk unit 105 as the 

5Q corresponding occurrence probability file 152. 

In a next step 1103, the document registration control 
program 111 starts the occurrence frequency file creation/ 
registration program 127 to read out the text 150 stored in 
the work area 170, to count occurrence frequencies of all the 

55 n-grams in each document and to store them into the 
magnetic disk unit 105 as the corresponding occurrence 
frequency file 153. 

The processing procedure of the document registration 
control program HI has been explained above. 

60 Explanation will then be made as to a processing proce- 
dure of the occurrence information file creation/registration 
program 121 to be activated by the document registration 
control program HI in the step 1101 of FIG. 11, by referring 
to a PAD of FIG. 12. 

65 The occurrence information file creation/registration pro- 
gram 121 first starts the single character type string extrac- 
tion program 161 in a step 1200 and divides character strings 
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in the text 150 at character type boundaries to extract single the system control program 110 in the step 1004 of FIG. 10 
character type strings and to store them into the work area will then be explained by referring to a PAD of FIG. 13. 
170 - The search control program 112 first starts the search 

In a next step 1201, the program 121 starts the occurrence conditional expression analysis program 130 in a step 1300 

information count program 122 to count an occurrence 5 to analyze a search conditional expression entered from the 
frequency of a predetermined length of n-gram in the text keyboard 101 and to extract the document number of a 
150, a head-position frequency of the single character type specified seed document as a parameter in the search con- 
string stored in the work area 170, a tail-position frequency ditional expression. 

thereof and an independent frequency thereof, and to store In a next step 1301, the program 112 starts the relevant 

them into the work area 170. 10 document search program 131 to calculate a similarity for 
In a net step 1202, the program 121 starts the occurrence each of the documents in the text 150 stored in the magnetic 
information file creation program 123 to add the occurrence disk unit 105 with respect to the seed document having the 
frequency, head-position frequency, tail-position frequency document number extracted by the search conditional 
and independent frequency of the n-gram in the text 150 expression analysis program 130. 

stored in the work area 170 to the occurrence frequency, 15 Irj a final step 1302 , the program 112 starts the searched 
head-position frequency, tail-position frequency and mde- result output pr0 gram 132 to output a searched result on the 
pendent frequency of the corresponding n-gram stored in the of the similarities, of the documents calculated by the 

occurrence information file 151, to store it into the work area relevant document search program 131 

2S™ T * mt %\ h % m f netic unit 105 as the 20 The processing procedure of the document search based 
occurrence information file 151. ^ 4 . «_ f r < n^v. » ■ j L 

on the search control program 112 has been explained above. 
He processing procedure of the occurrence information Explananon ^ next ^ made ^ to a proceS sing proce- 
fifc creation/registration program 121 has been explained dufe of the relevant program 131 to be 

a vc ' activated by the search control program 112 in the step 1301 

Next explanation will be made with use of a PAD of FIG. 25 0 f FIG. 13, with reference to a PAD of FIG. 14. 

16 as to a processing procedure of the occurrence probability relevant documen t search program 131 first starts the 

file creation/registration program 124 activated by the docu- ^ document read program 140 ^ a step 1400 ro read the 

men t regtstration control program 111 m the step 1102 of seed document of me nuinber extracted from the 

search conditional expression by the search conditional 

The occurrence probability file creation/registration pro- 30 expression analysis program 130 from the text 150 in the 

gram 124 first starts the occurrence probability calculation magnetic disk unit 105 to the work area 170. 

program 125 in a step 1600 to calculate an independent i D this casCj me rea din g D f the seed document may be 

probability, head-position probability and tail-position prob- realized not only by reading the document stored in ^ text 

ability of each n-gram from the occurrence information of 150 iato the work area 170 but also by directly inputting the 

each n-gram stored in the work area 170 and to store them 35 sccd documcnt from the keyboard 101 or by inputting it 

into the work area 170. from other means such ^ the floppy ^ 104 a C D-ROM 

In a next step 1601, the program 124 starts the occurrence drive (not shown in FIG. 1) or a communication line/Or the 

probability file creation program 126 to store the indepen- document reading may be realized by inputting the seed 

dent probability, head-position probability and tail-position document from a searched result of a full-text search system 

probability of each n-gram stored in the work area 170 into 40 or the like or by selecting the seed document from the output 

the magnetic disk unit 105 in the form of the occurrence of the searched result output program 132. 

probability file 152. Io a next step 1401 the re i eV ant document search program 

The processing procedure of the occurrence probability 131 starts the single character type string extraction program 

file creation/registration program 124 has been explained 161 of the shared library 160 to divide the text of the seed 

above. 45 document read by the seed document read program 140 at 

Explanation will next be made as to a processing proce- character type boundaries into single character type strings 

dure of the occurrence frequency file creation/registration and to store the character strings in the work area 170. 

program 127 to be activated by the document registration In a step 1402, the program 131 starts the characteristic 

control program 111 in the step 1103 of FIG. 11, with 5Q string extraction program 141 (which will be explained 

reference to a PAD of FIG. 25. later) to extract a characteristic string from the single char- 

The occurrence frequency file creation/registration pro- acter tyP e strings acquired by the above single character type 

gram 127 first starts the occurrence frequency count pro- string extraction program 161. 

gram 128 in a step 2500 to extract n-grams having lengths In a next step 1403, the program 131 starts the within- 
ranging from 1 to m (a length of single character type string 55 seed-document occurrence frequency count program 145 to 
itself) from all the single character type strings stored in the count an occurrence frequency of the characteristic string 
work area 170 in the step 1200 of FIG. 12, to count acquired by the characteristic string extraction program 141 
occurrence frequencies of the n-grams in the registration in the seed document. 

document and to store them into the work area 170. { Q a next step 1404, the program 131 starts the occurrence 

In a next step 2501, the program 127 starts the occurrence 60 frequency acquirement program 146 to acquire occurrence 

frequency file creation program 129 to store the occurrence frequencies of the characteristic string acquired by the above 

frequencies of the n-grams counted in the step 2500 together characteristic string extraction program 141 in the text 150 

with an identification number (also referred to as a document of the documents. 

number, hereinafter) of the registration document into the i n a final step 1405, the program 131 starts the similarity 

magnetic disk unit 105 as the occurrence frequency file 153. 65 calculation program 148 to calculate similarities between the 

A processing procedure of relevant document search seed document and the documents in the text 150 on the 

based on the search control program 112 to be activated by basis of the within-seed-document occurrence frequencies 
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acquired by the above wjlMn-seecWocument occurrence the Kanji and Katakana character strings, that is, -O-, -co-, 
frequency count program 145 and the occurrence frequen- 

cies of the documents in the text 150 acquired by the above " tc * 4 " ™ d ■ ■? characteristic strings, 

occurrence frequency acquirement program 146, with 71)6 s P ecific processing example of the characteristic 
respect to the characteristic strings acquired by the above « stnn 8 exlractl0n program 141 has been explained above, 

characteristic string extraction program 141. Explanation will next be made as to a processing proce- 

For the similarity calculation, although the aforemen- f™ of \ he ™T° n * ^ a , cqui * cment f ^ 146 
• -i„ *« ii i^v ' ♦ m u u i • to be activated by the characteristic string extraction pro- 
honed similarity calculation expression (1) has been used in m m ^ J 1404 f 4 * £ 

the present embodiment, other means may be employed. of FIG 26 

When the above expression (1) is used for the similarity 10 The * occurrence frequency acquirement program 146 
calculation and the above document of acquires the characteristic strings stored in the work area 

" . . . . v&mo-ri—tfMmztiho ..." is 170 in the step 1402 of FIG. 14 (step 2600). 

specified as the seed document, a similarity S(i) for a The program 146 executes the executes a step 2602 with 

document i is calculated as follows. respect to all the characteristic strings stored in the work area 

S(l)«1.0 is 170 (step 2601). 

S(2)=0.262 1° me ste P 2602, the program 146 activates the occurrence 

S(3)f=0.048 frequency file read program 147 to acquire occurrence 

Sf4)=0 0 frequencies of the characteristic strings in the documents in 

The processing procedure of the relevant document search , n the li ext 150 and to s,ore < hem f m " he work w 

program 131 has been explained above. 20 ™. e Prying procedure of the occurrence frequency 

Explanation will be made as to a processing procedure of ^™em program 146 has been explamed above, 

the characteristic string extraction program 141 activated by . Ex P 1 f il ' lon «» made as to a processmg proce- 

the relevant document search program 131 in the step 1402 dure of the P 0551 ^ companson/charactenstic string 

of FIG. 14, with reference to a PAD of FIG. 17. extraction program 142 to be activated by the charaaeristic 

In a step 1700, the characteristic string extraction program 25 e F xtracU0 ° P«W™ »«, » step 1703 of FIG. 17, 

141 acquires all the single character type strings stored in the w h ^T"^ \u w ■ ,u 

work area 170 by the single character type string extraction . l . a 3 step 1800 ' me P°^bihty comparison/characteristic 

program 161 in the step 1401 of FIG. U. string extraction program 142 sets 0 as an initial value of a 

Id a step 1701, the program 141 repetitively executes „ ^ character P 0 * tl0 ? *? be referred to as the latest 

subsequent steps 1702 to 1704 with respect to all I the single 30 dlv * ,on P 01Dt > *f ,° f ^ l^t-e'dracted characteristic string, 

character type strings acquired in the step 1700. , JJ» V«Vm 142 repetitively executes subsequent steps 

More specifically, in the step 1702, the program 141 1802 to 1809 when len ? b of ,h ° m P ut " character 

judges character types of the single character ty^e strings ^ , stnn 8 S ° f FI ° " * " 

acquired in the step 1700. When the character type is a Kanji c P^/rmned value or more (step 1801). 

or Katakana, the program 141 executes the step 1703; 35 l ? J?. 8 ** "°?» . ,he P ro 8 ram s,art ?. tl * d ™ si ° n 

whereas, when the character type is not a Kanji nor Probability calculator, program 143 (which will be 

katakana, the program 1401 executes the step 1704. explained later) to calculate a division probability P(i) of the 

In the step 1703, the program 141 starts the possibility , ch " acter , and a d ™ s,on P(l+1) * hen 

comparison/characteristic string extraction program 142 M counted from the head of the smgle character type stnng 

(which will be explained later) to extract characteristic 40 J . In . the ae * S , ,ep 18 °?' ^"ST , .ST"! I 

strings from the character strings of a single kanji or ^^V^>*^^^^*l)ciita^^^Mxm 

Katakana character type. dlvlslon Probabdity calculation program 143. When the 

In the step 1704, the program 141 extracts the single *mion probability P(i) is .larger than the division probabil- 

character type strings themselves other than the single Ktrnji „ $ the "W" ^executes the step U04 When 

or Katakana character type strings. 45 the di^ion probabdity P(i) is smaller than the division 

In a final step 1705, the program 141 stores the charac- y .^ + . 1 >' ' he l"*™™ . execul ? ^V*? 1806 

teristic strings extracted in the steps 1702 and 1703 in the * C ^f 1 "" F° * * Eq " '°u ^IZ 

work area 170 probability P\n-1), the program 142 executes the step 1808. 

The processing procedure of the characteristic string „ , la ,h ° st6 P l80 *' l ^ V 10 ®™ 142 «*■**> a c ^ ar ? ctor 

extraction program 141 has been explained above. 50 s ? u * of first to i-th characters when counted from the head 

Aprocessing procedure of the characteristic string extrac- of A th ° sm f character type strmg as a character- 

tion program 141 shown in FIG. 14 will be explained in f 1C S ' nD S\ And in the step 1805, the program 142 sets the 

connection with a specific example. lat f st divisi0n f?JJ *f a ' ' and "J* 1 l ° the Valu ? 

FIG. 27 shows an example of how to extract characteristic „ , f n ^ f 1 1806 ;. th " JT" T™* f 'u™' 

strings from the above document 1 of 55 f 10 * of lhe ^ l ° ( ,+1 ) th ch "^ ' n ' he ^racter 

type strmg as a characteristic string. And id the step 1807, 

" (om ^ * f * * * the program 142 sets the latest division point LS at (i+1) and 

The program 141 extracts single character type strings of ac jds 2 to the value i. 



-O-, ^ffi^-, In the step 1808, the program 142 extracts a character 

•fflfi-, -Kft * and "... " from the document 1 60 strin S of the a "t h t0 0+1)* characters when counted from the 

Next the program 141 judges character types of the head chara *er of the single character type string as a 

single character type strings, and calls the possibility characteristic string. And in the step 1809, the program 142 

comparison/characteristic string' extraction program 142 to sets the latest dmsi0D P oint 15 at and adds 1 t0 ^e 

extract the characteristic strings of Kanji character strings of va {!!f 

^ , ^ mrr * , , 65 The processing procedure of the possibility comparison/ 

and -fcfflB?- and a Katakana character strmg of characteristic string extraction program 142 bas been 

» 9 and also to extract the character strings other than explained above. 
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A processing procedure of the possibility comparison/ 
characteristic string extraction program 142 shown in FIG. 
18 will be explained in connection with a specific example. 

FIG. 8 shows an example of how to extract characteristic 
strings from the single character type string of 

•SMffUBS" extracted from the above document 1 of 

" mxm <omn wi—v row;**. 

A division probability P(l) of the first character in 
is calculated to be 0.001 as a product of an 
independent probability of 0.01 of "86- and a head -position 
probability of 0.11 of "ff " and a division probability P(2) of 
the second character in -SHfffcfS- is calculated to be 0.036 
as a product of an independent probability of 0.10 of 

and a head-position probability of 0.36 of . Next 
these division probabilities are compared and the single 

character type string of -&flMRg- is* divided atthe larger 
probability character. In this case, since the division prob- 
ability P(2) (-0.036) of the second character is larger than 
the division probability P(l) (-0.000) of the first character, 

the single character type string of •»?if'BS&' is divided into 

•*»• and -*SK 

FIG. 20 shows an example of how to extract a charac- 
teristic string from the single character type string 

extracted from the above document 1. First, a 
division probability P(2) of the second character in the 

* is calculated to be 0.00 as a product of an inde- 
pendent probability of 0.00 of ""7+-* and an independent 

probability of 0.00 of Next a division probability P(3) 
of the third character, that is, a possibility that 
•v+— » appears as a single character type string itself is 
calculated to be 0.79 as a product of a tail-position prob- 
ability of 0.79 of • 1— • and 1.0. These values are compared 
and the single character type string is divided at the larger 
probability character. In this case, since the division prob- 
ability P(3) (=0.79) of the third character is larger than the 
division probability P(2) (=0.00) of the second character in 

the single character type string is divided at the 
third character, with the result that the single character type 

string of •vt— is not divided. 

The specific processing procedure of the possibility 
comparison/characteristic string extraction program 142 has 
been explained above. 

Explanation will then be made as to a processing proce- 
dure of the division probability calculation program 143 to 
be activated by the possibility comparison/characteristic 
string extraction program 142 in the step 1801 of FIG. 18, 
with reference to a PAD of FIG. 19. 

In a step 1900, the division probability calculation pro- 
gram 143 acquires a calculation position i and the latest 
division point LS specified in the step 1801 of FIG. 18. 

Next in order to calculate a division probability P(i) at the 
calculation position i, the program 143 executes steps 1901 
to 1906 to acquire each occurrence probability. 

In the step 1901, first, the program 143 compares a length 
n of the n-gram extracted in the step 1201 of FIG. 12 with 
the calculation position i of the division probability. When 
(i-LS) is not larger than n, the program 143 executes the step 
1902; whereas, when (i-LS) is larger than n, the program 143 
executes the step 1903. 

In the step 1902, the program 143 starts the occurrence 
probability file read program 144 to acquire an independent 
probability of the first to i-th characters from the latest 
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division point LS and lo set an occurrence probability Pre(i) 
of a character string located forward of the division prob- 
ability calculation position i. 

In the step 1903, the program 143 starts the occurrence 

5 probability file read program 144 to acquire a tail-position 
probability of a last n-gram of a character string of from the 
latest division point LS to i-th characters and to set an 
occurrence probability Pre(i) of a character string located 
forward of the division probability calculation position i. 

to In the next step 1904, the program 143 compares the 
length Ln of a single character type string with the division 
probability calculation position i. When Ln is larger than 
(i+1), the program 143 executes the step 1905, whereas, 
when Ln is equal to (i+1), the program 143 executes the step 

is 1906. 

In the step 1905, the program 143 starts the occurrence 
probability file read program 144 to acquire a head-position 
probability of the (i+l)th 1-gram and to set an occurrence 
probability Post (i) of a character string after the division 

20 probability calculation position i. 

In the step 1906, the program 143 starts the occurrence 
probability file read program 144 to acquire an independent 
probability of the (i+l)th 1-gram and to set an occurrence 
probability Post(i) of a character string after the division 

25 probability calculation position i. 

In order to calculate a division probability P(i+1) at a 
calculation position (I^l), the program 143 executes steps 
1907 to 1913 and acquire occurrence probabilities. 
In the step 1907, the program 143 compares the length n 

30 of the n-gram extracted in the step 1201 of FIG. 12 with the 
calculation position i of the division probability. When 
((i+l)-LS) is not larger than n, the program 143 executes the 
step 1908, while, when ((i+l)-LS) is larger than n, the 
program 143 executes the step 1909. 

35 In the step 1908, the program 143 starts the occurrence 
probability file read program 144 to acquire an independent 
probability of a character string of from the character at the 
latest division point LS to the (i+l)th characters and to set 
an occurrence probability Pre(i+1) of a character string 

40 before the division probability calculation position (i+1). 
In the step 1909, the program 143 starts the occurrence 
probability file read program 144 to acquire a tail-position 
probability of a last n-gram of the string from the latest 
division point LS to the (i+l)th character and to set an 

45 occurrence probability Pre(i+1) of a character string after the. 
division probability calculation position (i+1). 

In the step 1910, the program 143 compares an length Ln 
of a single character type string with the division probability 
calculation position i. When Ln is larger than (i+2), the 

50 program 143 executes the step 1911; when Ln is equal to 
(i+2), the program 143 executes the step 1912, and when Ln 
is equal to (i+1), the program 143 executes the step 1913. 

In the step 1911, tie program 143 starts the occurrence 
probability file read program 144 to acquire a head-position 

55 probability of a 1-gram at the (i+2)th character and to set an 
occurrence probability Post(i+l) of a character string after 
the division probability calculation position (i+1). 

In the step 1912, the program 143 starts the occurrence 
probability file read program 144 to acquire an independent 

60 probability of the 1-gram at the (i+2)th character and to set 
an occurrence probability Post(i+l) of a character string 
after the division probability calculation position (i+1). 

In the step 1913, the program 143 sets the occurrence 
probability Post(i+l) of a string after the division probability 

65 calculation position (i+1) to be equal to 1. 

In the step 1914, the program 143 sets a product of the 
occurrence probability Pre(i) acquired through the above 
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steps 1901 to 1903 and the occurrence probability Post(i) pro duct of 0.10 of the independent probability of and 

acquired through the above steps 1904 to 1906 as the 036 of the head-position probability of to acquire a division 

division probability P(i) at the calculation position i; probability P(2) (=0.036) at the second character, 

whereas, the program 143 sets a product of the occurrence ^ ec ifi c processing procedure of the division prob- 

probability Pre(i+1) acquired through the above steps 1907 5 ability calculation program 143 has been explained above, 

to 1909 and the occurrence probability Post(i+l) acquired first embodiment of the present invention has been 

through the above steps 1910 to 1913 as the division described above. 

probability P(i+1) at the calculation position (i+1). [ D the present embodiment, the processing procedure of 

The processing procedure of the division probability the division probability calculation program 143 has been 

calculation program 143 has been explained above. 10 explained in connection with the case where the n-grams 

Explanation will next be made as to a processing proce- stored in the occurrence information file 151 and occurrence 

dure of the division probability calculation program 143 probability file 152 have a length of 2. However, it will be 

shown in FIG. 19 with use of a specific example. appreciated that the length may be a fixed value of 1 or 3, 

FIG. 28 shows an example of how to calculate division be a variable value based on information about occurrence 

probabilities of the single character type string of 15 probability and so on in the database, be the length m of the 

extracted from the above document 1 character type string itself, or be a combination 

~ ~ _ , » , thereof, which enables realization of the similar extracting 

. of- , . *m%xo*m met*. ..." in the ion of th6 . characteristic string . 

illustrated exampte it is assumed that n-grams stored m the Furth ^ fa , he processingproc6dur6 of , he a^ion 

occurrence probability file 152 lhave a length of 2 .and the i-th babilit calcuMoD program 143 has ^tn explained to 

character for calculation of the division probability is the for a docunlen t or documents having contents similar 

first character. In other words, the foUowing explanation will to of the xed document in the present embodiment, 

be made to calculate a division probability P(l) at the first ft win be ^ ^ , he ^ documeQt may be by a 

character and a division probability P(2) at the second spe dfied text to similarly extract characteristic strings and to 

aracter. „ , , . , , realize the relevant document searching operation. 

In order to confirm whether or not independent probabih- [q tfae , embodilneM) ^ processing procedure of 

ties of character strmgs to the first character are already ^ ossibility comparison/characteristic string extraction 

stored in an occurrence probability file 600, the program 143 U2 ^ ^ lained ^ with the 

fitst compares a length of 2 of the n-gram stored in the examp i c wh6 re the division probability of a character string 

occurrence probabihty file 600 with a division probability of , he ^ tQ Q . th m the sin le character , e 

calculation position of 1. As a comparison result, the length stri fe d ^ the division probabilily of a cbar . 

of the stored n-gram is longer, so that the program 143 ac(er sl[iQ of , he hea(J |Q (n+1)(h characters However, it 

acquires 0.01 of the independent probability of the character ^ ^ ^ tha , , he similar extraction of characteris t ic 

string "St* of up to the first character from the occurrence strings indicative of features in the document can be realized 

probability file 600. 35 even by comparing the division probability of a character 

In order to confirm how many characters are present string of characters of the tail to n-th characters backwards 

backwards of the division probability calculation position, in the single character type string with the division prob- 

the program 143 compares a length 4 of the single character ability of the tail to (n+l)th characters backwards or by 

type string with the division probability calcula- comparing the division probability of a character string of m 

tion position of 1. Since there is present a character string 40 characters) (m being an integer of 1 or more) in the single 

— u *u *ai character type string with the division probabihty of a 

-WWr- of 2 or more characters, the program 143 acquires , f* r «. A x 

_ . ^ character string or n characters), 

the head-position probability of 0.11 of from the The present embodiment has been explained above hav- 

occurrence probability file 600. And the program 143 cal- fag the arrangement including the possibility comparison/ 

culates a product of 0.01 of the independent probabihty of 4$ characteristic string extraction program 142 for the Kanji or 

and 0.11 of the head-position probability of to Katakana single character type string. When the present 

acquire a division probability P(l) (^D.001) at the first invention is desired to use for a database not containing 

character. Kanji or Katakana, however, the invention may be arranged 

Similarly, in order to confirm whether or not independent not to include the corresponding possibility comparison/ 

probabilities of character strings of the first to second 50 characteristic string extraction program 142, to include the 

character which is the division probabihty calculation posi- corresponding possibility comparison/characteristic string 

tion are already stored in the occurrence probability file 600, extraction program 142 suitable for non-Kanji or non- 

the program 143 compares a length 2 of the n-gram stored Katakana, or to include the characteristic string extraction 

in the file 600 with the division probability calculation programs corresponding to the character types, 

position of 2. Since the length of the stored n-gram is equal 55 The present embodiment has been arranged to extract 

to the calculation position, the program 143 acquires an characteristic strings from the single character type string, 

independent probability of 0.10 of the character string However, the invention may be arranged to extract charac- 

from the occurrence probability file 600. teristic strin S? from substrings spanning a specific boundary 

Next in order to confirm how many characters are present belween character types. In this case, for example, character 

backwards of the division probabihty calculation position, 60 strings of "Fl", 'V*l>c, -wff% and can be 

the program 143 compares a length 4 of the single character searched and thus an accurater relevant document searching 

type string with the division probability calcula- can be realized. 

tion position of 2. Since there is present a character string Further ' the occurrence information file creation/ 

' registration program 121 has regarded the character type 

of 2 characters, the program 143 acquires the head- 65 boundary as the separation between words to count the head, 

position probabihty of 0.36 of 'S* from the occurrence tail and independent frequencies of each n-gram in the 

probability file 600. And the ' program 143 calculates a present embodiment. However, the program 121 may be 
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arranged to regard an adjunct such as Joshi (Postpositional of an independent probability of 0.03 for 1-gram •*• an 

word functioning as an auxiliary to a main word) or Jodoushi . , , . . ... ~ 

(auxiliary verb) as a candidate of a break between words to "^pendent probabihty of 0.00 for ^gram-ft-. Similarly, 

count the head, tail and independent frequencies of each ** P ro S ram 142 calculates a probab.hty P(2) of dividing 

n-gram. 5 the single character type string of *&ttH&" at the second 

In the method of the present embodiment, the occurrence character into -ftfc- and -ifi- to be 0.004 as a product of an 

information file 151 has been created in the form of such a independent probability of 0.03 for a 2-gram word 

table as shown in FIG. 5. However, since increase in the _ ^ , . . . „ . .... trk c . 

length of the objective n-gram causes increase in the number ■*»' Md an mde P endent P«*abdity of 0.12 for a 1-gram 

of types of n-grams in the method, this requires a lot of time word • iS • . 

in the processing of the occurrence probability file creation/ In a next step 2201, the program 142 determines larger 

registration program 124. This problem can be solved by one of the probabilities P(l) and P(2) calculated in the step 

adding a searching index to a characteristic string. This 2200 as a division point and extracts a character string of the 

results in that, even when the number of n-gram types is head to division point characters in the single character type 

increased, high-speed registering operation can be reapplied. string as a characteristic string. In the illustrated example, 

The searching index may be a full-text searching index 2901 15 since the probability P(2) is larger than the probability P(l), 

or such a word index as disclosed in JP-A-8-329112. This the single character type string of -4t»£- is divided at the 

problem, which occurs even in the occurrence probability second tQ extract § charac(eristic strin of 

file 152 and occurrence frequency file 153, can be ehmuiated ^ first ^ characters ^ a characteristic string, 

by adding a similar searching index. fa a Qexl step ^ ^ program U2 ^ the position ^ 

The present embodiment has been arranged to start the 20 ( which ^ be to as a latest division point, 

occurrence probability file creation/registration program 124 hereinafter) of a tail character in a last-extracted character- 

at the time of registering a document to create the occurrence istic string at 2, and continues to perform its characteristic 

probability file 152. In this connection, however, when the string extracting operation over the single character type 

embodiment is arranged to calculate a corresponding occur- s{d . a . sub t to ue latest division int 

rence probability on the basis of the occurrence probabilities 25 1q a ^ gtep 2203 ^ program 142 extracts the single 

of the n-erams stored in the occurrence information file 151 , ..... 

at the time of executing the possibility comparison/ character type string -a- as a characteristic string, because 

characteristic string extraction program 142 for searching the length 1 of the character string -it- is less than a 

operation of the a relevant document or documents, the predetermined length of 2. As a result, a document of " . . . 

number of files to be stored in the magnetic disk unit 105 can 30 ,&<7>lR.tmxti& ex x »J 7#gaiBv* H&SfcSft* £ t 

be reduced. K&ofc. ( a service area named "Michi's Eki" was built 

In the present embodiment, the relevant document search- along a nat i 0 nal road) ..." is erroneously searched as a 

ing system using the characteristic string extracted by the relevant document. 

characteristic string extraction program 141 has been nQ processing example of the possibility comparison/ 

explained. However, the system may be used as a system for 35 characteristic string extraction program 142 in the first 

extracting a characteristic string from a seed document, or embodiment has been explained above. In the illustrated 

may be used in a system for extracting words contained in example, since the program 142 compares the division 

a document based on morphological analysis and automati- probabilities P(l) and P(2) of the first and second characters 

cally sorting documents using the extracted words, as and uses larger one of the probabilities as a division point, 

described in JP-A-8-153121. 40 .u • , u 

The possibility comparison/characteristic string extrac- me P ro S ram extracts and from the sm $ G char " 

tion program 142 in the first embodiment compares the acter type string as characteristic strings, which 

division probability P(i) at the i-th character with the divi- undesirably results in that a document or documents shifted 

sion probability P(i+1) at the (i+l)th character and divides from the central concept of the seed document are searched, 

the single character type string at the larger probability 4 * To avoid this, the second embodiment of the relevant 

character. For this reason the first embodiment has a problem document searching system of the present invention is 

that the program 142 extracts characteristic strings of (i+1) arranged so that, only when the division probability calcu- 

characters or less from all the single character type strings Iated at the time of extracting a characteristic string from a 

and erroneously divides words of characters longer than single character type string is higher than a predetermined 

(i+1) characters. 50 value (which will be referred to as a division threshold, 

Explanation will be made as to an example when the hereinafter), the system perform its comparing operation to 

above problem takes place that words of characters longer extract a characteristic string longer in length than (i+1) 

than (i+1) characters are erroneously divided by the program characters. 

142 in the first embodiment, with use of a specific example The present embodiment has substantially the same 

shown in FIG. 22. It is assumed in the illustrated example 55 arrangement as the first embodiment (FIG. 1), except that, 

that the single character type string is -W3- of a Kanji unlike the processing procedure of the possibility 

type and has an initial value of 1 at the division probability C ° mp ^ n/c ^ Cte ^ U L f*™ 1 ™?™®™ 

calculation position i. stc P s 2100 to 2104 are addcd as shown m a of FIG. 21. 

Hie possibility comparison/characteristic string extrac- J Explanation will . be then made as to a processing proce- 

tion program 142 first starts the above division probability 60 dure of the P°^ity comparison/characteristic string 

calculation program 143 in a step 2200 to calculate a e * ractl0n 142 * * * e 86000(1 embodiment, by 

division probability P(l) for the first character and a division reternn S t0 ™* J™? 1 . ' . , L 

probability P(2) for the second character. In the illustrated ! n a ste P 1800 ' the P^ihty compar^on/characteristic 

example, the program 142 calculates a probability P(l) of f lnn g fraction p r0 gram 142a sets the initial value of the 

F F6 * J w 65 latest division point LS at 0. 

dividing the single character type string of at the When lhe length of a sing i e character type string for 

first character into -4t- and -MM' to be 0.000 as a product extraction of a characteristic string therefrom is not less than 
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a predetermined value, the program 142a repetitively tion i is smaller than the length Ln, the programs 142a adds 

executes steps 1802 to 1807 and 2101 to 2103 (step 2100). 1 to the value of i. 

In a step 1802, the program 142a starts the division In a step 2304, the program 142a calculates the division 

probability calculation program 143 to calculate a division probability P(2) of the second character and the division 

probability P(i) of the i-th character in the single character 5 probability P(3) of the third character in the single character 

type string when counted from its head character as well as type string. In this example, the program 142a calculates a 

a division probability P(i+1) of the (i+l)th character. possibility of dividing at the second character into 

In the next step 2100, the program 142a compares the „ , _ , 

vduesofthedividonprobabiuS«P(OairfP(i+l)<alculated a ' 35 4 P roduct P0X-OOO4) of an indepen- 

by the above division probability calculation program 143 dent probability of 0.03 of the 2-gram word and an 

and the value of the predetermined division threshold Th to 1 independent probability of 0.12 of the 1-gram word 

extract maximum one among these values. When the pro- .3. whereas, the program calculates a possibility of occur- 

gram 142a extracts the division probability P(i) as a result of ^ ^ ^ wofd rf head characters 

foe above comparison, the program 142a executes the step duce P(3)(=0465) of a hea d-position probability of the 

1804; when the program 142a extracts the division prob- r Y 

ability P(i4.1), the program executes the step 1806; and when 15 2-gram word MtiS- and a tail-position probability of the 

the program 142a extracts the division threshold Th, the 2-gram -rfca*. 

program executes the step 2101., Io a next step 2305, the program 142a extracts maximum 

In the step 1804, the program 142a extracts a character one of the division probabilities P(2) and P(3) calculated in 

string of the first to i-th characters in the single character we above step 2304 and the division threshold Th. Since this 

type string as a characteristic string. And in the step 1805, 20 rcsults ™ extraction of the maximum P(3), the program 

the program 142a sets the latest division point LS at i and extracts the character string "4fc?6it- of the head to third 

adds 1 to the value of i. characters as a characteristic string. 

In the step 1806, the program 142a extracts a character As has been explained in the foregoing, in accordance 

string of the head to (i+l)th characters in the single character with the present invention, only when the division probabil- 

type string as a characteristic string. And in the step 1807, 25 itv ^ higher than the division threshold, comparing opera- 

the program 142a sets the latest division point LS at (i+1) tion fe carried out > 550 that Ae division of the single character 

and adds 2 to the value of i string at a position where division will not done from 

In the step 2101, the program 142a compares the division a language viewpoint can be avoided. For this reason, the 

probability calculation position i with the length Ln of the , number of unsuitable char^teristic strings exacted in the 

y . , , 3 # . 4 . /• i\ • 11 ,u *u 30 first embodiment can be reduced to a large extent. Thus the 

single character ype string. When (> + l) .5 smaller than the can fof ^ ^ indicative of 

character string length Ln, the program 142a executes the m6 seed document ^ a document or documents similar 

step 2102; while, when (l+l) is not smaller than the char- thereto at high speed 

acter string length Ln, the program 142a executes the step Explanation will next be made as to a third embodiment 

2103. 35 0 f t |^ e present invention, with reference to FIG. 29. 

In the step 2102, the program 142a adds 1 to the value of [ D the first and second embodiments, it is necessary to 

the division probability calculation position L previously store all possible character strings to be extracted 

In the step 2103, the program 142a extracts the single as characteristic strings in the occurrence frequency file 153. 

character type string itself as a characteristic string. And in This results in that, as the number of types in the character 

the step 2104, the program 142a sets the latest division point strings increases, it takes a lot of time to acquire occurrence 

LS to be equal to the character string length Ln and adds 1 40 frequencies of documents in the database, thus demanding 

to the value of i. an increased capacity of magnetic disk. 

The processing procedure of the possibility comparison/ The third embodiment of the relevant document searching 

characteristic string extraction program 142a has been system of the present invention is arranged so that, in order 

explained above. to acquire occurrence frequencies of documents in the 

The processing procedure of the possibility comparison/ 45 database with respect to characteristic strings extracted from 

characteristic string extraction program 142a in the second *e seed document, not the occurrence frequency file 153 but 

embodiment will be explained in connection with a specific a ^xt searchmg index is used to reduce the necessary 

example of FIG. 23. In this example, it is assumed that a capacity of magnetic disk. 

That is, in accordance with the present embodiment, a 

character string of Kanji characters is used as a 50 full-text searching system is used to acquire occurrence 

single character type string, the division threshold Th has a frequencies of documents in the database in the first 

value of 0.050, and the division probability calculation embodiment, whereby the system can realize searching for 

position i has an initial value of 1. a relevant document or documents at high speed even when 

In a step 2200, the possibility comparison/characteristic me database contains lots of types of character strings, 

string extraction program 142a first starts the division prob- 55 F urme r, the occurrence frequency file 153 is replaced by the 

ability calculation program 143 to calculate the division foil-text searching index, so that, when the relevant docu- 

probability P(l) of the first character and the division ment seeing system is implemented in the form of a 

probability P(2) of the second character and to obtain combination with the full-text searching system, the capac- 

P(1)=0.000 and P(2)=0.004. ity of magnet j c disk in the present embodiment can be made 

In the step 2301, the program 142a extracts maximum one 60 sma n er triarj that in the first embodiment, 

of the division probabilities P(l) and P(2) calculated in the present embodiment has substantially the same 

step 220 and the division threshold Th. Since this results in arrangement as the first embodiment (FIG. 1), but different 

extraction of the maximum division threshold Th, the pro- therefrom in the occurrence frequency file read program 147 

gram 142a compares in a step 2302 the division probability forming the occurrence frequency acquirement program 146 

calculation position i(=l) with the length Ln(-3) of the 65 ^ the re i evant document search program 131. This program 

single character type string -ftSfca*. As a result of the is replaced by such a full-text search program 2902 as shown 

comparison, since the division probability calculation posi- in FIG. 29, 
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Of the processing procedures of the present embodiment, n-grams. Next tbe extracted n-grams and the occurrence 

a processing procedure of the occurrence frequency acquire- positions of the n-grams in the characteristic string are input 

ment program 146a will be explained by referring to FIG. to an index searcher 1501. In the index searcher 1501, 

30. indexes of the n-grams extracted from the characteristic 

A difference of this program from the occurrence fre- 5 strin g m read out from the n-gram index 700, ones of these 

quency acquirement program 146 (FIG. 26) is only an indexes which coincide with each other in document number 

occurrence frequency acquiring step 3000. The other pro- and have thc a" 06 positional relationship as the positional 

cessing steps in the processing procedure are the same as relationship in the characteristic string is extracted and 

explained in the first embodiment. 0Ut P ut 35 a searched result ' 

In the occurrence frequency acquiring step 3000, the 10 Irj tne case of tne example where -fcS- is input as the 

full-text search program 2902 searches for characteristic characteristic string, in the n-gram extractor 1500, (1-gram 

strings stored in the work area 170 to acquire occurrence 1-gram position "1") and (1 -gram *S% 1-gram position 

frequencies of tbe characteristic strings in documents in the "2") are extracted. In this case, the n-gram position "1" 

text 150. indicates the head of the query term and the n-gram position 

The full-text search program 2902 used in the occurrence 15 " 2 " indicates the position of the character next thereto, 

frequency acquiring step 3000 in the present embodiment Irj the index searcher 1501, next, indexes corresponding to 

may be of any type. For example, such an n-gram index type the 1-gram m %* and # S' are read out from the n-gram 

may be employed as disclosed in JP-A-64-35627 (which index 700. Ones of the indexes which" have an identical" 

will be referred to as the prior art 4, hereinafter). occurrence document number and have continual occur- 

The n-gram index system of the prior art 4, at the time of 20 rence positions such as n-gram position "1" and n-gram 

registering a document, extracts n-gram words from text position "2", that is, adjacent ones are extracted and output 

data of the database registration document as well as occur- as a searched result. 

rence positions of the n-gram words in the text and previ- Io this examp i e> smce (2, 28) of the 1-gram and (2, 

ously stores them in a magnetic disk unit 2900 as a full-text __ x , „ , . t , . 

searching index 2901, as shown in FIG. 29. At the time of 25 29 > °f the l;g ram / SS " . f have ***** ™ mber 

searching operation, the system extracte n-gram words and have ad J acenl P 051 ' 10115 of 28 and ™ ' " 18 ^ 

appearing in a specified query term, reads out corresponding that there is an n-gram '^S* as a character string and it is 

indexes from the full-text searching index 2901 in the detected that the query term appears in the document 

magnetic disk unit 29W, compares occurrence positions of 2 Ho siQC6 (3 n) of ^ ± ., g . ^ QOt 

the n-gram words in the indexes, judges whether or not a 30 

positional relationship of the n-gram extracted from the adjacent to (3, 15) of the 1-gram it will be seen that the 

query term is equal to a positional relationship of the n-gram characteristic string m BS* does not appear at this position, 

in the index, whereby the system can search for a document And the system obtains an occurrence frequency of the 

or documents in which the specified query terms appear. characteristic string by counting the occurrence position 

In this system, when characteristic strings are input to the 35 output as a searched result from the index searcher 1501. 

full-text search program 2902 as query terms to acquire As has been explained in the foregoing, in accordance 

documents in which the characteristic strings appear and with the present embodiment, when the characteristic string 

their positional information, occurrence frequencies of the searching index of the occurrence frequency file and the 

characteristic strings in the documents can be obtained. full-text searching index in place of the occurrence fre- 

A method for acquiring an occurrence frequency in the 40 quency file are used, high-speed relevant document search- 
prior art 4 will be detailed with reference to FIGS. 7 and 15. ing can be realized without causing increase of useless files. 
In this case, n in n-grams is assumed to have a value of 1. Explanation will then be made as to a fourth embodiment 

Explanation will first be made as to a processing proce- of the present invention with use of FIG. 31. 

dure in a document registration mode with use of FIG. 7. In the first, second and third embodiments, the division 

The system reads a text 701 for database registration into an 45 probability of a character string of the head to n-th characters 

n-gram index creating/registering step 702 to create an in the single character type string extracted from the seed 

n-gram index 700. The index 700 stores all 1 -grams appear- document has been compared with the division probability 

ing in the text 701 and occurrence positions of the 1-grams of a character string of the head to (n+l)th characters to 

in the text. extract a characteristic string. However, since this requires 

Since the 1-gram appears at the 26th character io a 50 holding of the occurrence information file 151 and occur- 

document having a document number of 2 in the text 701 in rence probability file 152, an increase in the number of types 

the illustrated text 701, the n-gram index 700 stores the of character strings will cause an increase in the necessary . 

1-gram and an occurrence position (2,26) associated ca Pf il £ of ™ a S° et ^ dis ^ 

therewith. That is, (2,26) indicated that this word appear at . ^ fourth embodiment of the relevant document search- 

the 26th character in the document having a document 55 m * s y stem of th ' mv f/* 0D '» toreducethe 

number of 2 necessary capacity or magnetic disk by using the occurrence 

""Explanation will next be made as to a processing proce- *** 153 mformatioD file 

dure in a search mode by referring to FIG. IS. In this case, 151 and o^unenoe^probahhty file 152 

. .„ , j . ° .. , The fourth embodiment or the present invention is sub- 
explanation will be made in connection with an example . r . ,. 

where an occurrence frequency of the characteristic string 60 *!^**. 1Q *™& m ™ 1 as the first embodiment 

(FIG. 1), but is different therefrom in the characteristic string 

extracted from the above document 1 of extraction program 141 which forms the relevant document 

" St^&OftJB (n^i—f 1 l«jai:fc*o . . . " is acquired search program 131 and which includes an n-gram extrac- 

from the above n-gram index 700. tion program 3100 and the aforementioned occurrence fre- 

A characteristic string to be searched is first input to an 65 quency acquirement program 146. 

n-gram extractor 1500 to extract all n-grams appearing in the Of the processing operations of the present embodiment, 

characteristic string as well as occurrence positions of the a processing procedure of the characteristic string extraction 
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program 141a different from that in the first embodiment 
will be explained by referring to FIG. 32. 

In a step 3200, the characteristic string extraction program 
141a first starts the single character type string extraction 
program 161 to acquire all single character type strings 
stored in the work area 170. 

In a next step 3201, the program 141a repetitively 
executes subsequent steps 3202 to 3205 with respect to all 
the single character type strings acquired in the above step 
3200. 

That is, the program 141a starts the n-gram extraction 
program 3100 to extract all n-grams from the single char- 
acter type strings acquired in the step 3200 while shifting a 
predetermined length n (n being an integer of 1 or more) by 
every one character from the head character. 

And in the step 3203, the program 141a repetitively 
executes the next step 3204 for all the n-grams extracted by 
the above n-gram extraction program 3100. That is, in the 
step 3204, the program 141a starts the occurrence frequency 
acquirement program 146 to acquire occurrence frequencies 
of the n-grams extracted by the n-gram extraction program 
3100. 

In the step 3205, the program 141a sorts the occurrence 
frequencies of the n-grams acquired in the step 3204 in a 
descending order and extracts a predetermined number of 
n-grams from the top as characteristic strings. 

The processing procedure of the characteristic string 
extraction program 141a has been explained above. 

The processing procedure of the characteristic string 
extraction program 141a shown in FIG. 32 will be explained 
in connection with a specific example. 

FIG. 33 shows an example of how to extract characteristic 
strings from the aforementioned document 1 of 

w . . . .a^cs<o«ffl^w-7^-^raain^Aa . . . it is 

assumed in this example that n in n-gram has a value of 2 
and two 2-grams are extracted from each single character 
type string as featured n-grams. 

The program 141a first extracts single character type 

strings •aKfffcB-, *&m~, '(?>•, 

■HHv*-&*% and " . . . " from the document 1. 

The program 141a then extracts all 2-grams by shifting 
these single character type strings by every one character 
from the head character therein, and sorts occurrence fre- 
quencies of the 2-grams in a descending order. For example, 

the program 141a extracts three 2-grams *ffif« - and 

-'BB" from the single character type string •flWFTO' and 
acquires occurrence frequencies thereof in the database. As 

a result, the program 141a acquires iflwRt 5,283), mxti.&, 
462) and fcfx 269). In this case, 5,282) indi- 

cates that an occurrence frequency of the 2-gram -£B* in 
the database is 5,283. 

Next the program 141a extracts upper two of the 2-grams 
in each single character type string as featured n-grams. 

Since fc^Ki: 5,283) and iWi& 462) are upper two for 
the single character type string the program 141a 

extracts "SB- and m 3HE- as characteristic strings. 

The specific processing example of the characteristic 
string extraction program 141a has been explained above. 

As has been explained in the foregoing, in accordance 
with the present embodiment, since the occurrence infor- 
mation file 151 and occurrence probability file 152 are not 
used and the occurrence frequency file 153 is instead used, 
characteristic strings accurately reflecting actual occurrence 
circumstances in the database can be extracted. 
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In the present embodiment, the processing procedure of 
the n-gram extraction program 3100 has been explained in 
connection with the case where all n-grams having a pre- 
determined length of n are extracted while shifting each 
single character type string by every one character from the 
head character. However, any number of n-grams in the 
single character type string may be extracted, or m-grams (m 
being an integer of 1 or more) in the single character type 
string may be extracted. Further, the length n of n-grams to 
be extracted has been predetermined. However, the value of 
n may be changed according to the length of the single 
character type string or according to the type of the single 
character type string. Furthermore, since the n-gram extract- 
ing technique of the present invention can extract n-grams 
indicative of features of a document, this technique can be 
applied also to calculation of a vector indicative of features 
of a document using n-grams or to sorting of documents 
using n-grams. 

In accordance with the present invention, characteristic" 
strings can be extracted while lessening erroneous division. 
As a result, even when the system performs its relevant 
document searching operation without looking up the word 
dictionary, the system can search with use of meaningful 
character strings, thus realizing searching of a relevant 
document or documents less shifted from the main concept. 

What is claimed is: 

1. A method for extracting words contained in document 
data specified by a user comprising the steps of: 

extracting a substring from said document data and look- 
ing up word boundary probability information at a head 
or tail of a previously-prepared partial character string 
to calculate division probablities at at least two char- 
acter positions, 

wherein said word boundary probability information is 
used to determine a likely position a compound word 
should be divided; and 

comparing the division probabilities calculated in said 
extracting step at at least two or more character posi- 
tions to determine a division point in the word in the 
specified text. 

2. A method for extracting words contained in document 
data specified by a user comprising the steps of: 

extracting a substring from said document data and look- 
ing up word boundary probability information at a head 
or tail of a previously-prepared partial character string 
to calculate division probabilities at at least two char- 
acter positions; and 

comparing the division probabilities calculated in said 
extracting step at at least two or more character posi- 
tions to determine a division point in the word in the 
specified text, 

wherein said comparing step is replaced by a step of 
comparing the division probabilities calculated in said 
extracting step at at least two or more character posi- 
tions to determine a division probability in a word in 
the specified text. 

3. A method for extracting words contained in document 
data specified by a user comprising the steps of: 

extracting a substring from said document data and look- 
ing up word boundary probability information at a head 
or tail of a previously-prepared partial character string 
to calculate division probabilities at at least two char- 
acter positions; and 

comparing the division probabilities calculated in said 
extracting step at at least two or more character posi- 
tions to determine a division point in the word in the 
specified text, 
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wherein said extracting step is replaced by a step of 
extracting a substring from said specified text and 
looking up a probability that said substring is adjacent 
to a boundary of a predetermined character type at a 
head or tail of a previously-prepared substring to cal- 
culate a division probability of the characterposition. 

4. A relevant document searching method comprising the 
steps of: 

extracting one or more words from a text, which will be 
referred to as specified text, of a sentence or document, 
which will be referred to collectively as a document, 
specified by a user in a text database storing character 
data as code data therein; 

counting occurrence frequencies of the words extracted in 
said word extracting step in the specified text; 

acquiring occurrence frequencies of the words extracted 
in said word extracting step in document texts which 
will be referred to as registration texts, stored in said, 
text database; 

calculating similarities of the registration texts to the 
specified text in accordance with a predetermined cal- 
culation expression with use of the occurrence frequen- 
cies counted in said occurrence frequency counting step 
as well as the occurrence frequencies acquired in said 
occurrence frequency acquiring step; and 
outputting the similarities of the registration texts to the 
specified text calculated in said similarity calculating 
step as a searched result, 
wherein said word extracting step comprises the steps of: 
extracting a substring from said specified text and 
looking up word boundary probability information at 
a head or tail of a previously-prepared partial char- 
acter string and to calculate division probabilities at 
at least two character positions, and 
comparing the division probabilities calculated in said 
extracting a substring step at at least two or more 
character positions to determine a division point in 
the word in the specified text. 

5. A relevant document searching method as set forth in 
claim 4, wherein said comparing the division probabilities 
step is replaced by a step of comparing the division prob- 
abilities calculated in said division probability calculating 
step at at least two character positions to determine a 
division probability in a word in the specified text. 

6. A relevant document searching method as set forth in 
claim 4, wherein said extracting a substring step is replaced 
by a step of extracting a substring from said specified text 
and looking up a probability that said substring is adjacent 
to a boundary of a predetermined character type at a head or 
tail of a previously-prepared substring to calculate a division 
probability of a character position. 

7. A relevant document searching method as set forth in 
claim 6, further comprising the steps of: 

extracting a substring at a boundary of a redetermined 
character type in the registration text; 

calculating a possibility that said substring is adjacent to 
the character type boundary at a head or tail thereof and 
storing the possibility in a corresponding character type 
boundary probability file to register a document in a 
text database; and 

looking up said character type boundary probability file to 
acquire a possibility that the substring is adjacent to a 
boundary of a predetermined character type at the 
character position. 

8. A system for extracting a word from a sentence or 
document, which will be referred to collectively as a 
document, specified by a user, comprising: 
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means for extracting a substring from a text, which will be 
referred to as specified text, of the document specified 
by the user and looking up word boundary possibility 
information at a head or tail of a previously-prepared 
5 substring with respect to the extracted substring to 
calculate division probabilities at at least two character 
positions, 

wherein said word boundary possibility information is 
used to determine a likely position a compound word 
10 should be divided; and 

means for comparing the division probabilities at at least 
two character positions calculated in said means for 
extracting a substring to determine a division point of 
a word in the specified text. 
15 9. A system for extracting a word from a sentence or 
document, which will be referred to collectively as a 
document, specified by a user, comprising: 

means for extracting a substring from a text, which will be 
20 referred to as specified text, of the document specified 
by the user and looking up word boundary possibility 
information at a head or tail of a previously-prepared 
substring with respect to the extracted substring to 
calculate a division probability at a character position; 

25 

means for comparing the division probabilities at at least 
two character positions calculated in said means for 
extracting a substring to determine a division point of 
a word in the specified text, 
30 wherein said means for comparing is replaced by means 
for comparing the division probabilities calculated in 
said means for extracting a substring at at least two or 
more character positions to determine a division prob- 
ability in a word in the specified text. 
35 10. A system for extracting a word from a sentence or 
document, which will be referred to collectively as a 
document, specified by a user, comprising: 

means for extracting a substring from a text, which will be 
referred to as specified text, of the document specified 
40 by the user and looking up word boundary possibility 
information at a head or tail of a previously-prepared 
substring with respect to the extracted substring to 
calculate division probabilities at at least two character 
positions; and 

45 means for comparing the division probabilities at at least 
two character positions calculated in said means for 
extracting a substring to determine a division point of 
a word in the specified text, 

5Q wherein said means for extracting a substring is replaced 
by means for extracting a substring from said specified 
text and looking up a probability that said substring is 
adjacent to a boundary of a predetermined character 
type at a head or tail of a previously-prepared substring 

5S to calculate a division probability of the character 
position. 

11. A relevant document searching system comprising: 
means for extracting one or more words from a text, 
which will be referred to as a specified text, of a 
60 sentence or document, which will be referred to col- 
lectively as a document, specified by a user in a text 
database storing character data as code data therein; 
means for counting occurrence frequencies of the words 
extracted in said word extracting means in the specified 
65 text; 

means for acquiring occurrence frequencies of the words 
extracted in said word extracting means in document 
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texts, which will be referred to as registration texts, 
stored in said text database; 
means for calculating similarities of the registration texts 
to the specified text in accordance with a predetermined 
calculation expression with use of occurrence frequen- 
cies counted in said occurrence frequency counting 
means as well as occurrence frequencies acquired in 
said occurrence frequency acquiring means; and 
means for outputting the similarities of the registration 
texts to the specified text calculated in said similarity 
calculating means as a searched result, 
wherein said word extracting means comprises: 

means for extracting a substring from said specified 
text and looking up word boundary probability infor- 
mation at a head or tail of a previously-prepared 
partial character string to calculate division prob- 
abilities at at least two character positions, and 
means for comparing the division probabilities calcu- 
lated in said means for extracting a substring at at 
least two character positions to determine a division 
point in the word in the specified text. 

12. A relevant document searching system as set forth in 
claim 11, wherein said word division point judging means is 
replaced by means for comparing the division probabilities 
calculated in said division probability calculating means at 
at least two or more character positions to determine a 
division probability in a word in the specified text. 

13. A relevant document searching system as set forth in 
claim 11, wherein said means for extracting a substring is 
replaced by means for extracting a substring from said 
specified text and looking up a probability that said substring 
is adjacent to a boundary of a predetermined character type 
at a head or tail of a previously-prepared substring to 
calculate a division probability of the character position. 

14. A relevant document searching system as set forth in 
claim 13, further comprising: 

means for extracting a substring at a boundary of a 
predetermined character type in the registration text, 
calculating a possibility that said substring is adjacent 
to the character type boundary at a head or tail thereof, 
and storing the possibility in a corresponding character 
type boundary probability file to register a document in 
a text database; and 

means for looking up said character type boundary prob- 
ability file to acquire a possibility that the substring is 
adjacent to a boundary of a predetermined character 
type at the character position. 

15. A storage medium for storing a program, executable 
by a computer, for extraction of a word from a sentence or 
document, which will be referred to collectively as a 
document, specified by a user, said program when executed 
causes said computer to perform the steps of: 

extracting a substring from a text which will be referred 
to as specified text, of the document specified by the 
user and looking up word boundary possibility infor- 
mation at a head or tail of a previously-prepared 
substring with respect to the extracted substring to 
calculate division probabilities at at least two character 
positions, 

wherein said word boundary possibility information is 
used to determine a likely position a compound word 
should be divided; and 

comparing the division probabilities at at least two char- 
acter positions calculated in said division probability 
calculating step to determine a division point of a word 
in the specified text. 
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16. A storage medium for storing a relevant document 
search program, said program when executed by a computer 
causes the computer to perform the steps of: 

extracting one or more words from a text, which will be 
5 referred to as specified text, of a sentence or document, 
which will be referred to collectively as a document, 
specified b y a user in a text database storing character 
data as code data therein; 
counting occurrence frequencies of the words extracted in 

said word extracting step in the specified text; 
acquiring occurrence frequencies of the words extracted 
in said word extracting step in document texts, which 
will be referred to as registration texts, stored in said 
text database; 

calculating similarities of the registration texts to the 
specified text in accordance with a predetermined cal- 
culation expression with use of occurrence frequencies, 
counted in said occurrence frequency counting step as 
20 well as occurrence frequencies acquired in said occur- 
rence frequency acquiring step; and 
outputting the similarities of the registration texts to the 
specified text calculated in said similarity calculating 
step as a searched result, 
25 wherein said word extracting step comprises the steps of: 
extracting a substring from said specified text and 
looking up word boundary probability information at 
a head or tail of a previously-prepared partial char- 
acter string to calculate division probabilities at at 
30 least two character positions, and 

comparing the division probabilities calculated in said 
extracting a substring step at at least two character 
positions to determine a division point in the word in 
the specified text. 
35 17. A word extracting method as set forth in claim 1, 
wherein a probability that a substring having a predeter- 
mined length starting or ending at a specific character 
position appears adjacent to a character set boundary is used 
as said word boundary probability information. 
40 18. A method for extracting characteristic string from a 
document including a text, comprising the steps of: 

extracting a candidate string which is a candidate of a 
word starting or ending at an inter-word boundary from 
the text; 

45 calculating a division probability that the extracted can- 
didate string is divided at a position in the candidate 
string and repeating the calculation with respect to a 
plurality of positions; 
comparing the division probabilities with one another and 
50 dividing the candidate string into substrings at the 
position having the high division probability deter- 
mined by said comparing step; and 
extracting at least one of the substrings as the cbaracter- 
55 istic string. 

19. A characteristic string extracting method according to 
claim 18, wherein said division probability calculating step 
further comprises the steps of: 

obtaining a first probability that a n-gram (n is an integer 
60 equal to or larger than 1) ending at the position is a 
word tail or an independent word; 
obtaining a second probability that m-gram (m is an 
integer equal to or larger than 1) starting at the position 
is a word head or an independent word; 
65 multiplying the first probability by the second probability 
to obtain a product and determining the product as the 
division probability at the position. 
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20. A method for searching a text database storing docu- 
ments for a document relevant to a user specified document, 
comprising the steps of: 

detecting a inter-word boundary from a text in the user 
specified document and extracting a candidate string 5 
which is a candidate of a word starting or ending at the 
inter-word boundary from the text; 

calculating a division probability that the extracted can- 
didate string is divided at a position in the candidate 
string and repeating the calculating with respect to a 10 
plurality of positions; 

comparing the division probabilities with one another and 
dividing the candidate string into substrings at the 
position having the high division probability deter- 15 
mined by the comparing step; 

extracting at least one of the substrings as the character- 
istic string; ■ - ■- - 

counting occurrence frequency of the extracted charac- 
teristic string in the text; 20 

obtaining occurrence frequency of the extracted charac- 
teristic string in each document in the text database; 

calculating a similarity between the text and each docu- 
ment using the characteristic string occurrence fre- 
quency in the text and the characteristic string occur- 25 
rence frequency in each document; and 

outputting the similarity as a search result. 
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21. A relevant document search method according to 
claim 20, further comprising the step of: 

registering the document in the text database, wherein the 
document registering step includes the steps of: 
detecting a inter-word boundary from a text in the 
document and extracting a string which is a candi- 
date of a word starting or ending at the inter-word 
boundary from the text; 
extracting all n-grams, wherein n is equal to or larger 
than 1 and equal to or smaller than m, and where m 
is the length of the extracted string, from the 
extracted string; and 
storing a pair of an identification number of the docu- 
ment and an occurrence frequency of the n-gram in 
the text into an occurrence frequency file with the 
n-gram, 

wherein said occurrence frequency obtaining step 
includes the step of: 

referring to said occurrence frequency file to obtain 
the occurrence frequency of the characteristic 
string in each document. 

22. A relevant document search method according to 
claim 21, wherein said inter-word boundary in a character 
set boundary. 
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